expert demonstration
Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback
Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies.
Prompt Tuning Decision Transformers with Structured and Scalable Bandits
Prompt tuning has emerged as a key technique for adapting large pre-trained Decision Transformers (DTs) in offline Reinforcement Learning (RL), particularly in multi-task and few-shot settings. The Prompting Decision Transformer (PDT) enables task generalization via trajectory prompts sampled uniformly from expert demonstrations - without accounting for prompt informativeness. In this work, we propose a bandit-based prompt-tuning method that learns to construct optimal trajectory prompts from demonstration data at inference time. We devise a structured bandit architecture operating in the trajectory prompt space, achieving linear rather than combinatorial scaling with prompt size. Additionally, we show that the pretrained PDT itself can serve as a powerful feature extractor for the bandit, enabling efficient reward modeling across various environments. We theoretically establish regret bounds and demonstrate empirically that our method consistently enhances performance across a wide range of tasks, high-dimensional environments, and out-of-distribution scenarios, outperforming existing baselines in prompt tuning.
Human assisted Robotic Policy Refinement via Action Preference Optimization
Establishing a reliable and iteratively refined robotic system is essential for deploying real-world applications. While Vision-Language-Action (VLA) models are widely recognized as the foundation model for such robotic deployment, their reliance on offline expert demonstrations critically limits their capacity for postdeployment refinement. To mitigate this limitation, we introduce Action Preference Optimization (APO), a method designed to refine VLA models by human-assisted preference alignment gathered through interaction with environments. This method begins with a human-robot collaboration framework for reliable failure correction and interaction trajectory collection through human intervention. However, directly leveraging these interaction trajectories for preference optimization is non-trivial due to the challenges of irreversible robotic actions and token distribution mismatch. To solve this, APO proposes an adaptive reweighting algorithm with binary desirability signals derived from interaction, empowering VLA models effectively suppress failure-prone actions while enhancing corrective action adaptation. Ultimately, APO equips VLA models with the crucial capability to learn from failure, paving the way for their iterative refinement and reliable deployment in dynamic environments. The experiments conducted in simulation and real-world scenarios prove superior generalization and robustness of our human-assisted framework across a variety of manipulation tasks. We believe this work could bring insights for efficient and stable optimization of VLA models through human-robot collaboration.
Labeled DatasetLarge Unlabeled Dataset
This paper addresses the problem of learning avoidance behavior within the context of offline imitation learning. In contrast to conventional methodologies that prioritize the replication of expert or near-expert demonstrations, our work investigates a setting where expert (or desirable) data is absent, and the objective is to learn to eschew undesirable actions by leveraging demonstrations of such behavior (i.e., learning from negative examples). To address this challenge, we propose a novel training objective grounded in the maximum entropy principle. We further characterize the fundamental properties of this objective function, reformulating the learning process as a cooperative inverse Q-learning task. Moreover, we introduce an efficient strategy for the integration of unlabeled data (i.e., data of indeterminate quality) to facilitate unbiased and practical offline training. The efficacy of our method is evaluated across standard benchmark environments, where it consistently outperforms state-of-the-art baselines.
Text-Aware Diffusion for Policy Learning
Training an agent to achieve particular goals or perform desired behaviors is often accomplished through reinforcement learning, especially in the absence of expert demonstrations. However, supporting novel goals or behaviors through reinforcement learning requires the ad-hoc design of appropriate reward functions, which quickly becomes intractable. To address this challenge, we propose Text-Aware Diffusion for Policy Learning (TADPoLe), which uses a pretrained, frozen text-conditioned diffusion model to compute dense zero-shot reward signals for text-aligned policy learning. We hypothesize that large-scale pretrained generative models encode rich priors that can supervise a policy to behave not only in a text-aligned manner, but also in alignment with a notion of naturalness summarized from internet-scale training data. In our experiments, we demonstrate that TADPoLe is able to learn policies for novel goal-achievement and continuous locomotion behaviors specified by natural language, in both Humanoid and Dog environments. The behaviors are learned zero-shot without ground-truth rewards or expert demonstrations, and are qualitatively more natural according to human evaluation. We further show that TADPoLe performs competitively when applied to robotic manipulation tasks in the Meta-World environment, without having access to any in-domain demonstrations.
Minimax Optimal Online Imitation Learning via Replay Estimation
Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with H2/Nexp for behavioral cloning and H/ p Nexp for online moment matching, where H is the horizon and Nexp is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of parametric function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e.
204904e461002b28511d5880e1c36a0f-Supplemental.pdf
Similarly to [6], we consider that all environments have the same underlying Structural Causal Model (SCM) and that the different environments correspond to different interventions on the SCM. We provide here the formal definition for SCMs and interventions. We say that Xi causes Xj if Xi 2Pa(Xj). Definition A.2. (Intervention) [6]: Consider a SCMC =( S,N). An intervention e on C consists of replacing one or several of its structural equations to obtain an intervened SCMCe =( Se,N e) with structural equations: Sej: Xej fj(Pa(Xej),N ej), for j =1,...m (11) The variable Xe is intervened on if Si 6= Sei or Ni 6= Nei .